Custom Report

Author

Joseph H

1 Intoduction

This report is one of 493,845 that I will make, and one of 104,070,413 that could be made.
I “toke” the 1.4 TB Linked-In data that was breached in 2020, and turned it into some insights to power my job HUNT.
The insights I could share in this report, that are also related to my goals, are:
- Industry base recruitment trend.
- Company base workforce timeline.
- Current/part workforce info:
- Basic info: Name, job title, status, social link. I could add geo-location for some that have the data, but it would look creepy.
- Their work period.
- Their experiences.

2 About me

Salutations; I’m Joseph, a self-taught data analyst, engineer, and scraper.
Despite life’s challenges, my goal remains a remote job, either full or part-time, and having friends to tackle the challenges of this changing world with.
To show my skills and dedication, I made this project that yielded this tailored report.

3 About the project

3.1 How this project comes to life?

You would know by now, from my email, that I am hunting for a job.
About a year ago, I scraped contact info from Google Map to get my first job. Later I scraped contact from Linked-In website… you can check how that went in here.

Recently, I finally got to learning SQL because of DuckDB, it is a software that allows you to process big data in your local machine by using storage space as RAM; Then I remembered about a leaked Linked-In data that I couldn’t process.
Thus my journey started to learn SQL, process the data, and make something out of it.

3.2 The process

The process was done in my local machine, and it was as followed.

3.2.1 Downloaded the leaked data

I downloaded the data from a torrent.
There was around 700 .gz file, each is around 280 Mb; 196 GB in total.
Each .gz file contain a 2 GB file; 1.4 TB in total.
Each file have multiple lines, and each one of them is a JSON; Not the file is a JSON, it just have multiple JSONs, one in each line.

3.2.2 Processing the weird data

I this phase I created a script that automatically open an archive, process the file, and save it as a Parquet file with compression level of 22.
I used Python, Pathlib, Polars, and a lot of patience.
The process toke around 20 minutes per file, in total it toke around three weeks (I had to shutdown my PC at night) The result was 700 parquet files, each is around 190 Mb; 133 GB in total.

3.2.3 Making relational database

The data in the datasets were nested, especially the “experience” field, it had the experience of a person and the company info; The problem is that the company info get repeated multiple tiles, across all datasets.
Making a relational database will solve this, and make the exploratory data analysis easier.
The code was split in two:
1. I used Polars to split each of the 700 datasets into mini relational databases.
2. I used DuckDB to merge all the mini relational databases and remove duplicates in some, mainly company and university information’s.

The result was a relational database that is 73 GB in size; From 1.4 TB to 73 GB.

All of this is using my PC, so servers were harmed, only my CPU fan and my ear.

3.2.4 Filter

I filtered out companies base on their industry, country, and whether I have the email of one of the higher ups.

4 General graphs

4.1 market research indestry’s yearly new recruit count

4.2 scan scape’s workforce status over the years

5 Workforce sample

5.1 Adam Southcott

Job title: Field service representative
Associated: True
Socials: https://linkedin.com/in/adam-southcott-295143197

5.1.1 Adam Southcott’s working period at scan scape

5.1.2 Gantt plot of Adam Southcott’s experience


5.2 Chris Manning

Job title: Field service representative
Associated: False
Socials: https://linkedin.com/in/christopher-manning-704b82b1 | https://linkedin.com/in/chris-manning-704b82b1

5.2.1 Chris Manning’s working period at scan scape

5.2.2 Gantt plot of Chris Manning’s experience


5.3 Christopher Peters

Job title: Field audit services
Associated: True
Socials: https://linkedin.com/in/christopher-peters-57a48378

5.3.1 Christopher Peters’s working period at scan scape

5.3.2 Gantt plot of Christopher Peters’s experience


5.4 Dawn Perry

Job title: Field service representative
Associated: True
Socials: https://linkedin.com/in/dawn-perry-2bb515a1 | https://facebook.com/sunshine0935

5.4.1 Dawn Perry’s working period at scan scape

5.4.2 Gantt plot of Dawn Perry’s experience


5.5 Heather Cook

Job title: Field service representative
Associated: True
Socials: https://linkedin.com/in/artist0423 | https://linkedin.com/in/hthrck | https://facebook.com/sonettie

5.5.1 Heather Cook’s working period at scan scape

5.5.2 Gantt plot of Heather Cook’s experience


5.6 Jojo Bituin

Job title: Field sales representative
Associated: True
Socials: https://linkedin.com/in/jojo-bituin-a5817b55 | https://linkedin.com/in/jojo-bituin-1985b341 | https://facebook.com/jojo.bituin.3

5.6.1 Jojo Bituin’s working period at scan scape

5.6.2 Gantt plot of Jojo Bituin’s experience


5.7 Joseph Muglia

Job title: Field service representative
Associated: True
Socials: https://facebook.com/100010541714584 | https://linkedin.com/in/joseph-muglia-302774141

5.7.1 Joseph Muglia’s working period at scan scape

5.7.2 Gantt plot of Joseph Muglia’s experience


5.8 Josh Logan

Job title: Field representative
Associated: False
Socials: https://linkedin.com/in/josh-logan-a559671 | https://twitter.com/joshlogan98 | https://facebook.com/josh.logan.9693

5.8.1 Josh Logan’s working period at scan scape

5.8.2 Gantt plot of Josh Logan’s experience


5.9 Julie Mestan

Job title: Field manager
Associated: True
Socials: https://linkedin.com/in/julie-mestan-86236680

5.9.1 Julie Mestan’s working period at scan scape

5.9.2 Gantt plot of Julie Mestan’s experience


5.10 Kenneth Reddinger

Job title: Field service representative
Associated: True
Socials: https://linkedin.com/in/kenneth-reddinger-51426343 | https://facebook.com/ken.reddinger.9 | https://linkedin.com/in/kennethreddinger

5.10.1 Kenneth Reddinger’s working period at scan scape

5.10.2 Gantt plot of Kenneth Reddinger’s experience


5.11 Kim Elliott

Job title: Field staff
Associated: True
Socials: https://linkedin.com/in/kim-elliott4425

5.11.1 Kim Elliott’s working period at scan scape

5.11.2 Gantt plot of Kim Elliott’s experience


5.12 Lori Shriner

Job title: Field service representative
Associated: True
Socials: https://facebook.com/ltotheshriner | https://linkedin.com/in/lori-shriner-5b31a81

5.12.1 Lori Shriner’s working period at scan scape

5.12.2 Gantt plot of Lori Shriner’s experience


5.13 Summer Coppage

Job title: Field representative
Associated: True
Socials: https://linkedin.com/in/summer-coppage-9122a7181

5.13.1 Summer Coppage’s working period at scan scape

5.13.2 Gantt plot of Summer Coppage’s experience


5.14 Todd Christian

Job title: Field service representative
Associated: True
Socials: https://linkedin.com/in/todd-christian-29a164134

5.14.1 Todd Christian’s working period at scan scape

5.14.2 Gantt plot of Todd Christian’s experience